11 research outputs found
On the Connection between Pre-training Data Diversity and Fine-tuning Robustness
Pre-training has been widely adopted in deep learning to improve model
performance, especially when the training data for a target task is limited. In
our work, we seek to understand the implications of this training strategy on
the generalization properties of downstream models. More specifically, we ask
the following question: how do properties of the pre-training distribution
affect the robustness of a fine-tuned model? The properties we explore include
the label space, label semantics, image diversity, data domains, and data
quantity of the pre-training distribution. We find that the primary factor
influencing downstream effective robustness (Taori et al., 2020) is data
quantity, while other factors have limited significance. For example, reducing
the number of ImageNet pre-training classes by 4x while increasing the number
of images per class by 4x (that is, keeping total data quantity fixed) does not
impact the robustness of fine-tuned models. We demonstrate our findings on
pre-training distributions drawn from various natural and synthetic data
sources, primarily using the iWildCam-WILDS distribution shift as a test for
downstream robustness
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent
The capacity of neural networks like the widely adopted transformer is known
to be very high. Evidence is emerging that they learn successfully due to
inductive bias in the training routine, typically a variant of gradient descent
(GD). To better understand this bias, we study the tendency for transformer
parameters to grow in magnitude ( norm) during training, and its
implications for the emergent representations within self attention layers.
Empirically, we document norm growth in the training of transformer language
models, including T5 during its pretraining. As the parameters grow in
magnitude, we prove that the network approximates a discretized network with
saturated activation functions. Such "saturated" networks are known to have a
reduced capacity compared to the full network family that can be described in
terms of formal languages and automata. Our results suggest saturation is a new
characterization of an inductive bias implicit in GD of particular interest for
NLP. We leverage the emergent discrete structure in a saturated transformer to
analyze the role of different attention heads, finding that some focus locally
on a small number of positions, while other heads compute global averages,
allowing counting. We believe understanding the interplay between these two
capabilities may shed further light on the structure of computation within
large transformers.Comment: To appear at EMNLP 202
Neural Priming for Sample-Efficient Adaptation
We propose Neural Priming, a technique for adapting large pretrained models
to distribution shifts and downstream tasks given few or no labeled examples.
Presented with class names or unlabeled test samples, Neural Priming enables
the model to recall and conditions its parameters on relevant data seen
throughout pretraining, thereby priming it for the test distribution. Neural
Priming can be performed at test time in even for pretraining datasets as large
as LAION-2B. Performing lightweight updates on the recalled data significantly
improves accuracy across a variety of distribution shift and transfer learning
benchmarks. Concretely, in the zero-shot setting, we see a 2.45 improvement in
accuracy on ImageNet and 3.81 accuracy improvement on average across standard
transfer learning benchmarks. Further, using our test time inference scheme, we
see a 1.41 accuracy improvement on ImageNetV2. These results demonstrate the
effectiveness of Neural Priming in addressing the common challenge of limited
labeled data and changing distributions. Code is available at
github.com/RAIVNLab/neural-priming.Comment: 18 pages, 8 figures, 9 table
Matryoshka Representation Learning
Learned representations are a central component in modern ML systems, serving
a multitude of downstream tasks. When training such representations, it is
often the case that computational and statistical constraints for each
downstream task are unknown. In this context rigid, fixed capacity
representations can be either over or under-accommodating to the task at hand.
This leads us to ask: can we design a flexible representation that can adapt to
multiple downstream tasks with varying computational resources? Our main
contribution is Matryoshka Representation Learning (MRL) which encodes
information at different granularities and allows a single embedding to adapt
to the computational constraints of downstream tasks. MRL minimally modifies
existing representation learning pipelines and imposes no additional cost
during inference and deployment. MRL learns coarse-to-fine representations that
are at least as accurate and rich as independently trained low-dimensional
representations. The flexibility within the learned Matryoshka Representations
offer: (a) up to 14x smaller embedding size for ImageNet-1K classification at
the same level of accuracy; (b) up to 14x real-world speed-ups for large-scale
retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for
long-tail few-shot classification, all while being as robust as the original
representations. Finally, we show that MRL extends seamlessly to web-scale
datasets (ImageNet, JFT) across various modalities -- vision (ViT, ResNet),
vision + language (ALIGN) and language (BERT). MRL code and pretrained models
are open-sourced at https://github.com/RAIVNLab/MRL.Comment: 35 pages, 12 figures. NeurIPS 2022 camera ready publicatio
DataComp: In search of the next generation of multimodal datasets
Multimodal datasets are a critical component in recent breakthroughs such as
Stable Diffusion and GPT-4, yet their design does not receive the same research
attention as model architectures or training algorithms. To address this
shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset
experiments centered around a new candidate pool of 12.8 billion image-text
pairs from Common Crawl. Participants in our benchmark design new filtering
techniques or curate new data sources and then evaluate their new dataset by
running our standardized CLIP training code and testing the resulting model on
38 downstream test sets. Our benchmark consists of multiple compute scales
spanning four orders of magnitude, which enables the study of scaling trends
and makes the benchmark accessible to researchers with varying resources. Our
baseline experiments show that the DataComp workflow leads to better training
sets. In particular, our best baseline, DataComp-1B, enables training a CLIP
ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming
OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training
procedure and compute. We release DataComp and all accompanying code at
www.datacomp.ai
Improving shape deformation in unsupervised image-to-image translation
Unsupervised image-to-image translation techniques are able to map local texture between two domains, but they are typically unsuccessful when the domains require larger shape change. Inspired by semantic segmentation, we introduce a discriminator with dilated convolutions that is able to use information from across the entire image to train a more context-aware generator. This is coupled with a multi-scale perceptual loss that is better able to represent error in the underlying shape of objects. We demonstrate that this design is more capable of representing shape deformation in a challenging toy dataset, plus in complex mappings with significant dataset variation between humans, dolls, and anime faces, and between cats and dogs. ?? Springer Nature Switzerland AG 2018